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Abstract 

Particle-in-cell (PIC) simulations with Montc-Carlo collisions are used in plasma 
science to explore a variety of kinetic effects. One major problem is the long run- 
time of such simulations. Even on modern computer systems, PIC codes take a 
considerable amount of time for convergence. Most of the computations can be 
massively parallelized, since particles behave independently of each other within 
one time step. Current graphics processing units (CPUs) offer an attractive 
means for execution of the parallelized code. In this contribution we show 
a one-dimensional PIC code running on Nvidia® CPUs using the CUDA"'"'^ 
environment. A distinctive feature of the code is that size of the cells that the 
code uses to sort the particles with respect to their coordinates is comparable 
to size of the grid cells used for discretization of the electric field. Hence, we 
call the corresponding algorithm "fine-sorting". Implementation details and 
optimization of the code are discussed and the speed-up compared to classical 
CPU approaches is computed. 
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1. Introduction 

In the past years, the development of central processing imits (CPUs) for 
consumers changed from just increasing the clock frequency and integrating 
special functions into the core to the design of processors with multiple cores. 
Accordingly, the use of parallel computing becomes more and more important, 
even though most computer programs presently do not take advantage of this 
possibility. One branch that optimized the parallelization perfectly and also 
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optimized their hardware for that purpose is the manufacturing industry of 
graphics cards. Modern 3-D appUcations allow for the parallel computation of 
many independent values, such as pixels on a display. 

Number of scientific efforts that take advantage of the computational power 
of graphics processing units (GPU) is growing. Recently, many publications 
were made where calculations benefit from the speed of graphics processing 
units dJ [2 [31 m El El E] . Main convenience of this approach is a high speed 
multi core processing unit that is comparatively low priced. 

Low pressure plasmas play a major role in a wide range of industrial applica- 
tions, including, among other things, semi-conductor processing, surface coating 
or sterilization processes. Characteristics of these gas discharges, such as pres- 
sure, input power, reactor sizes, length scales of particle collisions or the typical 
length of electric field screening (Debye-length) span several orders of magni- 
tude, depending on the actual application. This makes modeling extremely 
difficult, because many assumptions or simplifications of a specific model are 
usually only applicable to a particular case. 

Fluid models, which consider velocity distributions functions (mostly Maxwellian) 
become improper under certain conditions, due to their nature of using integral 
quantities and principles like diffusion. PIC simulations [H El [10] trace parti- 
cles (as electrons and ions) on their way inside a plasma, being impacted by 
collisions, the electric field and the walls of the reactor. Thereby, distribution 
functions of particles are approximated by PIC simulations. By averaging (time 
or space integrations) over certain quantities, macroscopic and measurable val- 
ues such as particle density, flux and current can be calculated. Since only a 
few assumptions are made, particle-in-cell (PIC) simulations retain most of the 
fundamental, nonlinear effects in a plasma. 

In this article we present a PIC code parallelized for execution on a graphics 
card, using Nvidia's CUDA environment in C. In the second section we give a 
short introduction into particle-in-cell simulations. In section [3] the algorithm 
is explained in detail. In section [3] we run the code under different conditions 
and obtain parameters for an optimized speed-up. 

2. Particle-In-Cell Basics 

Particlc-in-ccU simulations follow the movement of charged particles inside a 
plasma. Actually, every simulated (super) particle represents a certain amount 
of real particles. Particle locations and velocities are defined in continuum space, 
whereas charge density, electric potential and field are defined on a spatial grid 
of size A'gi.itj. This work examines one spatial and three velocity components of 
the particles. 

Figure [l] shows the modules of the PIC algorithm, that are passed through 
every finite time step At. Beginning at the top, charge density is calculated 
by weighting particles on the grid, using a linear interpolation (first order) 
scheme in this work. After that, Poisson's equation is solved and the electric 
field is calculated. Field values on the grid point are interpolated to particle 
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Figure 1: Computing sequence of the particlc-in-cell cycle with Monte-Carlo collisions. 

positions and then particles accelerate and move inside the simulation box. 
Particles that left the finite box are removed from the simulation every At. If 
secondary electrons or electron reflection are featured, electrons are inserted at 
the boundaries. Now collisions take place, handled by a Monte-Carlo module 
and the loop is closed. 

To render an actual plasma behavior, some conditions have to be satisfied. 
The finite grid spacing (distance between two grid points) has to be smaller than 
the Debye-length to resolve field effects on the correct length scale. Additionally, 
time step size has to be small enough to resolve the highest frequency processes 
in the system, which are oscillations with electron plasma frequency. Recently 
it was shown that the number of particles inside a Debye-sphere has to be larger 
than expected until then to reduce or avoid spurious numerical effects [TT]. All 
this leads to an enormous amount of computational power, needed by such 
simulations. 

3. Implementation Details 

3.1. GPU Programming with CUDA 

Nvidia's Compute Unified Device Architecture (CUDA) is an architecture 
for parallel computations on graphics processing units. Programs arc written in 
the programming language C as regular C-code with some extensions to provide 
access to the GPU [T^ . 

Graphics cards have a number of parallel multiprocessors, depending on the 
generation and model of the GPU. A multiprocessor features a certain amount 
of processor cores, each having its own fast register memory, but the cores of 
a multiprocessor also share the on-chip shared memory space. All different 
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Figure 2: Fine-sorted particle array. Each sorting cell i has a fixed size A^max and is filled 
with particle data up to a certain number A'^^. Information about the current status Ni of a 
sorting cell has to be stored in a separate (integer) array. 



multiprocessors have access to the large global memory. Additionally, there are 
two cached read-only memory spaces, constant memory and texture memory. 

Special functions, called kernels, run on the GPU in parallel, using thousands 
or millions of independent threads. Threads are grouped into thread blocks, 
whose size is chosen by the programmer. All threads of such a block run on the 
same physical multiprocessor, thus having access to the same shared memory 
space. Constant and texture memory are read-only spaces for threads and can 
be exclusively filled with data from outside a kernel by special functions. 

3.2. Data Structure 

Particle data is stored in a float4 array, using the x, y and z component 
for velocities and the w component for the location of a particle. Each particle 
species (such as electrons or ions) uses its own array. All arrays are divided 
into sorting cells, each of the same fixed size (figure [2]). Each sorting cell in 
GPU memory belongs to a certain region in configuration space, accordingly 
all particles within a sorting cell are close to each other. Information on how 
many particles Ni currently reside in a sorting cell is stored in an integer array, 
one integer for each cell. All sorting cells therefore are partially filled with the 
currently residing particles and the rest is free for new particles entering the 
ceh. 

Such sorting cells in memory can contain multiple grid points of the particle- 
in-cell spatial grid, that defines the values of charge density, potential and field. 
The number of grid points per sorting cell N^^ can be configured by the user at 
the beginning of a simulation and infiuences the speed. 

If a kernel is called, each sorting cell is handled by a single block of threads 
(and thus a single multiprocessor, figure [s]). Using the information from the 
integer array (current number of particles inside a sorting cell), threads of a 
block can treat (e.g. push) all valid particles of a sorting cell and leave the 
remaining parts of the fioat4 array unaffected. 

The advantage of a fine-sorting code is, that this kind of code allows for 
a straightforward and fast implementation of nonlinear collision processes (i.e. 
Goulomb collisions) through the binary collisions, which is difficult otherwise. 
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Figure 3: Calling a kernel function in the fine-sorted algorithm. Each cell is handled by a 
single thread block and thus by a single multiprocessor. 



In this case, size of the sorting cells must not exceed the Debye-length. Addi- 
tionally, other modules of the PIC cycle benefit from the fine-sorting approach, 
as stated in the following. A good overview of advantages and drawbacks of 
different ways of GPU implementations can be found in 13] . It is worth noting 
that unlike the codes on CPUs, such as flj, we did not observe any significant 
improvement of the code run-time compared to an alternative GPU algorithm, 
where particle data is unsorted, as will be described in our future work. 

3.3. Normalization 

Equations are normalized to minimize the amount of additional factors and 
thereby reduce the computational effort by the code. Finite difference integra- 
tion of the equations of motion lead to multiplications with the time step Ai, so 
normalizing time t on this time step leads to the simplest case of a multiplication 
with 1 

Determination of the nearest grid point pi of a particle is an important part 
of all grid dependent calculations. Normalizing the spatial coordinate x on the 
lattice spacing Aa; 

X — >■ Ax • X 

involves a single type casting operation (float to integer) to calculate pi. 

To simplify the charge density assignment, charge density p is normalized 
on the charge density of a single super-particle 

. Q - 

p-'a;k-x-P' 

with Q, the charge of super-particle and the electrode area of the discharge. 
Electric potential $ is normalized to ease Poisson's equation 
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Figure 4: Ions (red) and electrons (blue) of one sorting cell only contribute to the charge 
density of grid points that belong to their cell (linear weighting). Each sorting cell contains 5 
grid points in this case. 



with eo, the the vacuum permittivity. Electric field is normalized on 



E —V • E 

Co Ac 



and velocities are normalized on 

Aa; 
V — > -r- ■ V. 
At 

Skipping all tildes for normalized values, the equations read 



= -p{r,z) (1) 



dx 



dt ~" 

with M the mass of a super-particle. 



(4) 



3.4^. Charge Density Assignment 

Before the kernel is started, the number of particles in each sorting cell Ni 
is copied to constant memory. All threads of a block treat particles of a single 
sorting cell. Therefore particles of one block only contribute to a small area in 
the charge density array, namely A'gc grid points belonging to that sorting cell 
(figure |4]). 

Threads allocate a local array of iVgc floats in register memory and initialize 
the array with zeros. Afterwards each thread starts loading a particle to register 
memory, adding its charge to the local charge density array. The i-th thread 
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Figure 5: Charge density calculation. Threads sum rho in register memory, exemplary shown 
for only four threads and a single sorting cell i. The reduction is done using shared memory, 
the last active thread stores the result in global memory. 



of each block thereby executes the i-th particle of the sorting cell. If there are 
more particles than the size of the block, this procedure is repeated in strides 
equal to the block size as many times as needed. 

Once all particles are done, a reduction in shared memory starts. Each 
thread copies its whole local array to a shared memory array. Now half of the 
threads becomes inactive. Each active thread adds the charge density of an 
inactive thread to its own shared array. Again and again half of the threads 
becomes inactive, until only a single thread is left. This thread copies the 
reduced charge density of the current cell to global memory. The very left 
and right grid points of a sorting cell are used by the cell and the neighboring 
cell, respectively. To avoid global memory atomics, the right grid point of each 
sorting cell is stored separately in a float array in global memory and is added 
to the charge density in a subsequent kernel function. The whole procedure is 
shown in figure [Sj 

3.5. Field Solver 

Discretizing Eq.Q with finite differences leads to a tridiagonal matrix. A 
detailed description of including different boundary conditions can be found in 
[15]. Solving can be done very efficiently serially on the CPU, using the Thomas 
algorithm |16j . an optimized Gaussian elimination scheme for tridiagonal ma- 
trices. 

Since charge density is calculated on the GPU, the array must be copied to 
CPU memory, Poisson's equation is solved and the electric field is calculated. 
After this, the electric field array is copied to GPU constant memory. Despite 
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the amount of memory transfer between CPU and GPU, this module takes only 
a small part of the run-time of the PIC cycle. For one million electrons and ions, 
respectively, and 800 grid points for the field, data copy and solving takes about 
3.55% of the run-time of a cycle. Thus, a more complicated parallel approach 
that runs completely on the GPU does not seem to be worth it for a one- 
dimensional simulation. For large number of grid points or a two-dimensional 
approach, this is of course not the case. 



3.6. Particle Pusher 

In this kernel, the current number of particles in each sorting cell iV^ and 
the electric field are read from constant memory. Each thread loads a particle 
to register memory and updates the particle velocity by multiplying the inter- 
polated field with K (Eq.([3|). After this, the new particle location is calculated 
(Eq.Q), using the new velocity. This relates to the commonly used leap-frog 
algorithm, where location and velocity are separated in time by half of a time 

step miH]. 

v{t + 0.5) ~v{t~ 0.5) = kE {x (t)) 
x{t + l) -x{t) = u(t + 0.5) 

The result is stored in global memory. Just as for charge density assignment, 
threads will load additional particles, if the number of particles exceeds the 
block size. 



3.7. Fine-Sorting Algorithm 

After each pushing, some particles reside in incorrect sorting cells. Since the 
Courant-condition H] needs to be satisfied, this is only a small fraction of 
particles. 

Ax 

< ^ < A^ec -ir- (5) 

At ^ At ^ ' 

Vc is the thermal velocity of electrons, so inequality ^ basically means, that 
only a small number of particles leaves its sorting cell during one time step. 
Accordingly, sorting is optimized for a "nearly sorted" particle array. 

In a first kernel, threads copy incorrect particles to free positions at the end 
of their current sorting cell. This memory space serves as a buffer for sorting 
particles (see Fig. Since particles of one sorting cell are checked in parallel 

by threads of a single block, all threads have access to the same shared memory. 
Hence, a counter Ajndex, i ur shared memory can provide indices for the new 
particle positions at the end of the sorting cell. If a thread spots an invalid 
particle, it decreases A^indox, i by one, using the atomicSub function and thereby 
gets the new index. 

After this, each sorting cell has gaps (blue symbols in Fig.(|6|) which have to 
be filled with valid particles. Threads scan particles a second time and fill the 
gaps with information from the last valid particle in each sorting cell, respec- 
tively (Fig.([7|). Therefore a second counter in shared memory allocates correct 
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Figure 6: Step la of the sorting algorithm for sorting cell i. Rejected particles (blue symbols) 
move to free positions (red crosses) at the end of sorting cell i. 
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Figure 7: Step lb of the sorting algorithm for sorting cell i. Gaps are closed with valid 
particles from current sorting cell i. 



9 



sorting sorting sorting 
cell i-1 cell i cell i+1 




Figure 8: Step 2 of the sorting algorithm. Threads of block i copy rejected particles from the 
neighboring sorting cells to cell i. 

particle's indices at position Ni, Ni — 1 and so on. These two steps can be 
included in the particle pusher, to re-use particle data efficiently and thereby 
reduce the memory transfer from global memory. The number of rejected par- 
ticles in each sorting cell and the new number of valid particles are stored 
in integer arrays in global memory. 

The second step of sorting starts in a new kernel. Again, a thread block 
is started for each sorting cell i. Now, half of the threads checks the rejected 
particles of the left neighbor i — I, the other half checks the right neighbor 
i + 1 (figure |8| . A counter in shared memory defines the new particle index in 
sorting cell i and is incremented with the atomicAdd function. Particles from 
the neighboring sorting cells are just read, so no atomic operations are needed 
in global memory. The updated number of valid particles is stored in global 
memory. 

The last kernel of the sorting algorithm has to identify rejected particles, 
that are not sorted into the correct sorting cells yet. These are rare particles 
that moved farther than one sorting cell. Threads of a thread block now check 
rejected particles of its own sorting cell. If such a fast particle is located, its 
information is copied to the correct sorting cell (Fig.Q). For assigning indices 
to the particle, a counter in global memory is needed. This is the sole exception 
for the use of atomic functions in global memory. Since threads from all blocks 
access the same indices this step has to be done in global instead of shared 
memory. Again, the number of particles is stored in global memory. After this. 
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Figure 9: Step 3 of the sorting algorithm. Threads of bloclc i copy rejected particles from 
their own sorting cell (sc) to other cells using atomic functions in global memory for the index 
counter. 
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all particles are arranged in correct sorting cells. 

3.8. Monte-Carlo Collisions 

Monte-Carlo collisions are treated by a modified null collision method. A 
detailed description of the standard algorithm can be found in [171 HH] • 

Here, null collision frequency i^o is calculated once in the usual way, but 
instead of picking 

TVo- (l-e-^*''«)iVp = PoA^P (6) 

particles randomly from the total iVp number of particles, a random number p 
is drawn for every particle. This number is compared with Pq: the null collision 
probability. For p < Pg the regular null collision algorithm is started, that is 
calculating collision cross sections and probabilities for the current particle. At 
an average A^o particles run through the standard null collision algorithm, which 
satisfies equation |6] Using this modified null collision method, global memory 
atomics can be avoided, because every particle is just treated by one individual 
thread. If a particle is created due to ionization, it is injected into the sorting 
cell of the colliding electron. Thus, a counter in shared memory can allocate 
indices for those new particles. 

Consequently, each thread needs its own random number generator (RNG). 
In principal, every RNG that uses only a single seed is sufficient, because mem- 
ory access for initializing the RNG has to be small for a fast implementation. 
In this work we use a 3-xor-shift (11,7,12) generator [12], which can produce 
random numbers very quickly on the GPU. An unsigned integer array allocates 
seeds. Each thread loads its own seed from global memory, generates a certain 
number of random values and stores the last seed back to global memory. 

CPU simulations usually use look-up tables for collision cross sections. On 
the GPU, simple functions, using only products and sums, can be much faster 
than scattered reads from memory. Therefore, cross sections for argon are cal- 
culated directly, using approximations of measured values [501 IHl 122 • 

Even though this implementation creates lots of branches, it is much faster 
than CPU collision handling. 

4. Optimizing the Parameters and Speed-up Measurements 

Variation of Grid Points per Sorting Cell 
The algorithm sorts particles into sorting cells, which can contain any num- 
ber of grid points of the charge density or rather the electric field array (figure 
|4]). At the smallest possible size, there are just two grid points per sorting cell, 
one on the left, one on the right. The number of particles inside each sorting cell 
is low and all kernels have to be called for a large number of cells. It is obvious, 
that overhead due to calling functions and kernels slows down this approach. 
On the other hand, the reduction algorithm is not suitable for a large number 
of grid points per sorting cell A'gc and also the amount of register memory can 
be limiting. Thus, the GPU may run slower than for smaller sorting cells. Also, 
there is an upper boundary on A'gc governed by the requirement that sorting cell 



12 



Q. 

B 
E 



03 

a. 




3 4 5 

grid points per sorting cell 



Figure 10: Execution time of a single cycle for different numbers of grid points per sorting cell 
Ngc- Simulation runs with 500,000 ions and electrons, respectively at a pressure of 10 Pa. 



size does not exceed the Debye-length, in case one wants to implement Coulomb 
collisions. 

Between these to extremes one can suspect an optimized version. Figure 
10 shows the run-time of a single cycle of the simulation as a function of A^gc- 
In fact, a minimum can be found for A^gc ~ 3. Note, that for some cases the 
minimum is found at N^^ = 4, depending on the generation of GPU and all 
input values of the simulation. 



4-. 2. Variation of the Block Size 

CUDA allows for choosing the size of blocks with a maximum of 512 or 1024 
threads per block, for compute capability 1.2 and 2.0 hardware, respectively. In 
many applications, large blocks with many parallel threads are the best choice. 
Since CUDA hides memory access latencies of a thread by handling other threads 
meanwhile, small blocks are usually not the best option. For different inputs a 
block size of 128 had the best performance. Figure [Tl] shows the run-time over 
the block size for a system of 500,000 ions and electrons, respectively. These 
results were obtained with the same block size for all different kernels, but also 
every single module showed a similar behavior. 
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Figure 11: Execution time of a single cycle for different block sizes. Simulation runs with 
500,000 ions and electrons, respectively at a pressure of 10 Pa. 
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4-3. Electrical Grid Size Dependence 

In this section the number of grid points A^grid of the electric field lattice 
is changed. The total number of particles and the number of grid points per 



sorting cell A^gc remain constant. Figure 12 shows a nearly linear dependence of 
the execution time on A^grid- By increasing A'gj-id, also the CUDA kernel grid size 
rises and a larger number of CUDA blocks has to be called. All dependencies 
are linear, small variances in figure [12] can be explained by changing number of 
particles per sorting cell. This behavior is different from CPU implementations, 
where run-time is nearly independent of the electric field grid size. 

4.4- Speed-up to CPU 

For comparison to classical CPU approaches, all diagnostics in CPU code 
xpdpl |23] were removed, so only the plain PIC algorithm remains. For testing 
issues we use a GTX480 GPU and a single core of an Intel 17 870 CPU running 
at 2.93 GHz. These two devices are not only comparable with regard to up- 
to-dateness when establishing this study, but also in respect of their costs. For 
the test case of 500.000 ions and electrons (respectively) at a pressure of 20 Pa, 
a single time step takes about 13.70 ms on the CPU and about 1.09 ms on the 
GPU. 
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Figure 13: Speed-up of the fine-sorted PIC algorithm, compared to a CPU (single core) 
approach as a function of the number of super-particles for different pressures. N^^i^ = 512. 
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Figure 13 shows the speed-up of the fine-sorted GPU algorithm compared to 
the CPU code for three different pressures. Speed-up is calculated by dividing 
the CPU run-time by the GPU run-time. For increasing number of particles 
■^totai the system behaves as expected: low number of particles can not uti- 
lize the GPU fully, so increasing A'^totai also enhances the speed-up. For lower 
numbers of particles (in the range of a few thousands), overhead from calling 
functions on the GPU and data transfer between CPU and GPU memory be- 
comes more and more important. Hence the CPU code can become faster than 
the GPU for very small systems. Since PIC statistics improve with larger sys- 



tems [TT], we attended to larger systems in our study. Functions in figure 13 



are not saturated in respect of iVtotai, so for very large systems even higher 
speed-ups than 20 can be expected. 

Increasing the pressure from 1 Pa to 20 Pa slows down the GPU algorithm 
for high values of iVtotai- This is not obvious, since higher collision rates result 
in less branches and thereby advantages for the GPU. At a pressure of 1 Pa, 
Monte-Carlo collisions only take a negligible fraction of a cycle's execution time, 
so do not influence the speed-up. For increasing pressure, the collision module 
becomes more and more important. For a pressure of 20 Pa the speed-up of just 
the Monte-Carlo collisions is about 11.5 for a system of 1,000,000 electrons and 
ions, respectively. Now the total speed-up for 1 Pa (about 15.2) is higher than 
this value, so if collisions become more important, the whole speed-up decreases. 

An increase in pressure from 20 Pa to 100 Pa makes the simulation slightly 
faster. Again, the collisions become more important regarding the run-time of 
a cycle, but this time the collision module is more than 19 times faster than the 
1 Pa case. This results in an overall increase in speed. 

As on the CPU, most of the time is spend for all particle related tasks, 
pusher, charge density assignment and for high pressure also the Monte-Carlo 
module. Fine-sorting of particles of course also takes a considerable amount of 
time. 



5. Summary and Conclusion 

A fine-sorting particle-in-cell code, running on a single graphics processing 
unit was implemented, using Nvidia's CUDA environment. For testing we used 
newest generation GPU and CPU. All in all for a normal range of input pa- 
rameters, speed-ups of about 10-20 compared to a classical CPU code could be 
observed. A major advantage of the fine-sorting algorithm is that it allows for 
a straightforward implementation of binary collisions. 

The code is not bounded to any special (small) grid size. Thus, algorithms 
can be used for two-dimensional PIC codes, except for the field solving. Due 
to large numbers of particles and an efficient two-dimensional field solver, much 
higher speed-ups than in the one-dimensional case can be expected. 
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